Gaussian Process Optimization in the Bandit Setting: 
No Regret and Experimental Design 
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Abstract 

Many applications require optimizing an un- 
known, noisy function that is expensive to 
evaluate. We formalize this task as a multi- 
armed bandit problem, where the payoff function 
is either sampled from a Gaussian process (GP) 
or has low RKHS norm. We resolve the impor- 
tant open problem of deriving regret bounds for 
this setting, which imply novel convergence rates 
for GP optimization. We analyze GP-UCB, an 
intuitive upper-confidence based algorithm, and 
bound its cumulative regret in terms of maximal 
information gain, establishing a novel connection 
between GP optimization and experimental de- 
sign. Moreover, by bounding the latter in terms 
of operator spectra, we obtain explicit sublinear 
regret bounds for many commonly used covari- 
ance functions. In some important cases, our 
bounds have surprisingly weak dependence on 
the dimensionality. In our experiments on real 
sensor data, GP-UCB compares favorably with 
other heuristical GP optimization approaches. 

1. Introduction 

In most stochastic optimization settings, evaluating 
the unknown function is expensive, and samphng 
is to be minimized. Examples include choosing 
advertisements in sponsored search to maximize 
profit in a click-through model (Pandey & Olston, 
2007) or learning optimal control strategies for robots 
(Lizotte et al., 2007). Predominant approaches 
to this problem include the multi-armed bandit 
paradigm (Robbins, 1952), where the goal is to 
maximize cumulative reward by optimally balancing 
exploration and exploitation, and experimental design 
(Chaloncr & Vcrdinclli, 1995), where the function 
is to be explored globally with as few evaluations 
as possible, for example by maximizing information 



^This is the longer version of our paper in ICML 2010; 
see Srinivas et al. (2010) 
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gain. The challenge in both approaches is twofold: we 
have to estimate an unknown function / from noisy 
samples, and we must optimize our estimate over some 
high-dimensional input space. For the former, much 
progress has been made in machine learning through 
kernel methods and Gaussian process (GP) models 
(Rasmussen & Williams, 2006), where smoothness 
assumptions about / are encoded through the choice 
of kernel in a flexible nonparametric fashion. Beyond 
Euclidean spaces, kernels can be defined on diverse 
domains such as spaces of graphs, sets, or lists. 

We are concerned with GP optimization in the multi- 
armed bandit setting, where / is sampled from a GP 
distribution or has low "complexity" measured in 
terms of its RKHS norm under some kernel. We pro- 
vide the first sublinear regret bounds in this nonpara- 
metric setting, which imply convergence rates for GP 
optimization. In particular, we analyze the Gaussian 
Process Upper Gonfidence Bound (GP-UCB) algo- 
rithm, a simple and intuitive Bayesian method (Auer 
et al., 2002; Auer, 2002; Dani et al., 2008). While 
objectives are different in the multi-armed bandit 
and experimental design paradigm, our results draw 
a close technical connection between them: our regret 
bounds come in terms of an information gain quantity, 
measuring how fast / can be learned in an information 
theoretic sense. The submodularity of this function 
allows us to prove sharp regret bounds for particular 
covariance functions, which we demonstrate for com- 
monly used Squared Exponential and Matern kernels. 

Related Work. Our work generalizes stochastic 
linear optimization in a bandit setting, where the 
unknown function comes from a finite-dimensional 
linear space. GPs are nonlinear random functions, 
which can be represented in an infinite-dimensional 
linear space. For the standard linear setting, Dani 
et al. (2008) provide a near-complete characterization 



(also see Aucr 2002; Dani et al. 2007; Aberncthy ct al. 
2008; Rusmevichientong & Tsitsiklis 2008), explicitly 
dependent on the dimensionality. In the GP setting, 
the challenge is to characterize complexity in a differ- 
ent manner, through properties of the kernel function. 
Our technical contributions are twofold: first, we 
show how to analyze the nonlinear setting by focusing 
on the concept of information gain, and second, we 
explicitly bound this information gain measure using 
the concept of submodularity (Ncmhauser ct al., 
1978) and knowledge about kernel operator spectra. 

Kleinberg et al. (2008) provide regret bounds un- 
der weaker and less configurable assumptions (only 
Lipschitz-continuity w.r.t. a metric is assumed; 
Bubcck et al. 2008 consider arbitrary topological 
spaces), which however degrade rapidly with the di- 

d + l 

mensionality of the problem {n{T^+^)). In practice, 
linearity w.r.t. a fixed basis is often too stringent 
an assumption, while Lipschitz-continuity can be too 
coarse-grained, leading to poor rate bounds. Adopting 
GP assumptions, we can model levels of smoothness in 
a fine-grained way. For example, our rates for the fre- 
quently used Squared Exponential kernel, enforcing a 
high degree of smoothness, have weak dependence on 
the dimensionality: 0(^T(logT)''+i) (see Fig. 1). 

There is a large literature on GP (response surface) 
optimization. Several heuristics for trading off explo- 
ration and exploitation in GP optimization have been 
proposed (such as Expected Improvement, Mockus 
et al. 1978, and Most Probable Improvement, Mockus 
1989) and successfully applied in practice (c./., Lizottc 
et al. 2007). Brochu ct al. (2009) provide a comprehen- 
sive review of and motivation for Bayesian optimiza- 
tion using GPs. The Efficient Global Optimization 
(EGO) algorithm for optimizing expensive black-box 
functions is proposed by Jones et al. (1998) and ex- 
tended to GPs by Huang et al. (2006). Little is known 
about theoretical performance of GP optimization. 
While convergence of EGO is established by Vazquez 
& Beet (2007), convergence rates have remained elu- 
sive. Griinewalder et al. (2010) consider the pure ex- 
ploration problem for GPs, where the goal is to find the 
optimal decision over T rounds, rather than maximize 
cumulative reward (with no exploration/exploitation 
dilemma). They provide sharp bounds for this explo- 
ration problem. Note that this methodology would not 
lead to bounds for minimizing the cumulative regret. 
Our cumulative regret bounds translate to the first 
performance guarantees (rates) for GP optimization. 

Summary. Our main contributions are: 

• We analyze GP-UCB, an intuitive algorithm for 
GP optimization, when the function is either sam- 
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Figure 1. Our regret bounds (up to polylog factors) for lin- 
ear, radial basis, and Matern kernels — d is the dimension, 
T is the time horizon, and is a Matern parameter. 

pled from a known GP, or has low RKHS norm. 

• We bound the cumulative regret for GP-UCB in 
terms of the information gain due to sampling, 
establishing a novel connection between experi- 
mental design and GP optimization. 

• By bounding the information gain for popular 
classes of kernels, we establish sublinear regret 
bounds for GP optimization for the first time. 
Our bounds depend on kernel choice and param- 
eters in a fine-grained fashion. 

• We evaluate GP-UCB on sensor network data, 
demonstrating that it compares favorably to ex- 
isting algorithms for GP optimization. 

2. Problem Statement and Background 

Consider the problem of sequentially optimizing an un- 
known reward function / : Z? — )• M: in each round t, we 
choose a point Xt E D and get to see the function value 
there, perturbed by noise: yt — f(xt) + £*. Our goal is 
to maximize the sum of rewards X]t=i fi^t), thus to 
perform essentially as well as x* = argmax^g^ fi^) 
(as rapidly as possible). For example, we might want 
to find locations of highest temperature in a building 
by sequentially activating sensors in a spatial network 
and regressing on their measurements. D consists of 
all sensor locations, f{x) is the temperature at x, and 
sensor accuracy is quantified by the noise variance. 
Each activation draws battery power, so we want to 
sample from as few sensors as possible. 

Regret. A natural performance metric in this con- 
text is cumulative regret, the loss in reward due to not 
knowing /'s maximum points beforehand. Suppose 
the unknown function is /, its maximum point ^ 
X* = argmax^gp fi^)- For our choice Xt in round 
t, we incur instantaneous regret rt = f(x*) — f{xt). 
The cumulative regret Rt after T rounds is the sum 
of instantaneous regrets: Rt — X^tLi''*- desirable 
asymptotic property of an algorithm is to be no-regret: 
limT^oo Rt/T = 0. Note that neither rt nor Rt are 
ever revealed to the algorithm. Bounds on the average 
regret Rt/T translate to convergence rates for GP 
optimization: the maximum inaxt<T fi^t) in the first 
T rounds is no further from f{x*) than the average. 



^ X* need not be unique; only f{x*) occurs in the regret. 



2.1. Gaussian Processes and RKHS's 

Gaussian Processes. Some assumptions on / are 
required to guarantee no-regret. Wliile rigid paramet- 
ric assumptions such as linearity may not hold in prac- 
tice, a certain degree of smoothness is often warranted. 
In our sensor network, temperature readings at closeby 
locations are highly correlated (see Figure 2(a)). We 
can enforce implicit properties like smoothness with- 
out relying on any parametric assumptions, modeling 
/ as a sample from a Gaussian process (GP): a col- 
lection of dependent random variables, one for each 
X E D, every finite subset of which is multivariate 
Gaussian distributed in an overall consistent way (Ras- 
musscn & Williams, 2006). A GP{^j,ix),k{x,x')) is 
specified by its mean function ii{x) — E[f{x)] and 
covariance (or kernel) function k{x, x') = E[(/(x) — 
fi{x)){f{x') — iJ,{x'))]. For GPs not conditioned on 
data, we assume^ that fi = 0. Moreover, we restrict 
k{x,x) < 1, X G D, i.e., we assume bounded variance. 
By fixing the correlation behavior, the covariance func- 
tion k encodes smoothness properties of sample func- 
tions / drawn from the GP. A range of commonly used 
kernel functions is given in Section 5.2. 

In this work, GPs play multiple roles. First, some of 
our results hold when the unknown target function is a 
sample from a known GP distribution GP(0, k{x, x')). 
Second, the Bayesian algorithm we analyze generally 
uses GP(0, A;(x, a;')) as prior distribution over /. A 
major advantage of working with GPs is the exis- 
tence of simple analytic formulae for mean and co- 
variance of the posterior distribution, which allows 
easy implementation of algorithms. For a noisy sam- 
ple ^ [y^ ... yrp]'^ at points At = {xi, Xt}, 
yt = f{xt)+et withef ^ iV(0,CT^) i.i.d. Gaussian noise, 
the posterior over / is a GP distribution again, with 
mean iit{x), covariance kx^x, x') and variance a^ix): 

fiT{x) = kTixf{KT + a^I)-'yT. (1) 

krix, x') = k{x, x') - krixfiKr + a^iy^krix'), 

(Tt{x) = kT{x,x), (2) 

where kx(x) ~ [k{xi,x) ... k{xT,x)]'^ and K-r is 
the positive definite kernel matrix [k{x,x')]x.x'eAT- 

RKHS. Instead of the Bayes case, where / is sam- 
pled from a GP prior, wc also consider the more ag- 
nostic case where / has low "complexity" as measured 
under an RKHS norm (and distribution free assump- 
tions on the noise process). The notion of reproduc- 
ing kernel Hilbert spaces (RKHS, Wahba 1990) is in- 
timately related to GPs and their covariance func- 
tions k{x,x'). The RKHS T-Lk{D) is a complete sub- 
space of L2{D) of nicely behaved functions, with an 



This is w.l.o.g. (Rasmussen & Williams, 2006). 



inner product (•, obeying the reproducing property: 
{f,k{x,-))k = fix) for all / e HkiD). It is literally 
constructed by completing the set of mean functions 
Ht for all possible T, {xt}, and y^- The induced 
RKHS norm ||/||fe = (/, f)k measures smoothness of 
/ w.r.t. k: in much the same way as ki would generate 
smoother samples than k2 as GP covariance functions, 
II • llfcj^ assigns larger penalties than || • W^^. (•, can be 
extended to all of L2{D), in which case ||/||fc < oo iff 
/ G T-Lk{D). For most kernels discussed in Section 5.2, 
members of HkiD) can uniformly approximate any 
continuous function on any compact subset of D. 

2.2. Information Gain & Experimental Design 

One approach to maximizing / is to first choose 
points Xt so as to estimate the function globally 
well, then play the maximum point of our estimate. 
How can we learn about / as rapidly as possible? 
This question comes down to Bayesian Experimental 
Design (henceforth "ED"; see Chaloner & Vcrdinelli 
1995), where the informativeness of a set of sampling 
points A <Z D about / is measured by the information 
gain (c.f., Cover & Thomas 1991), which is the mutual 
information between / and observations — + ea 
at these points: 

I(y^;/)=H(y^)-H(y^|/), (3) 

quantifying the reduction in uncertainty about / 
from reveahng y^- Here, = [f{x)]x^A and 

EA iV(0,cr2j). For a Gaussian, H(7V(/x,S)) = 
^log|27reS|, so that in our setting I(y^;/) = 
^{VaJa) = flogl-f + (r^'^KA\, where Ka = 
[k{x,x')]x.x' £A- While finding the information gain 
maximizer among A d D, \A\ < T is NP-hard (Ko 
et al., 1995), it can be approximated by an efficient 
greedy algorithm. If F{A) = \{yA]f), this algorithm 
picks Xt — argmax^g^, F(At_iU{a;}) in round t, which 
can be shown to be equivalent to 

Xt = argmaxcrt_i(a;), (4) 
xeD 

where At-i = {xi^ . . . ,Xt-i}. Importantly, this 
simple algorithm is guaranteed to find a near-optimal 
solution: for the set At obtained after T rounds, we 
have that 

F{At) > (1 - lie) max F{A), (5) 

\A\<T 

at least a constant fraction of the optimal infor- 
mation gain value. This is because F{A) satisfies 
a diminishing returns property called suhmodularity 
(Krause & Guestrin, 2005), and the greedy approxima- 
tion guarantee (5) holds for any submodular function 
(Ncmhauscr et al., 1978). 

While sequentially optimizing Eq. 4 is a provably good 
way to explore f globally, it is not well suited for func- 



tion optimization. For the latter, we only need to iden- 
tify points X where f{x) is large, in order to concen- 
trate sampling there as rapidly as possible, thus exploit 
our knowledge about maxima. In fact, the ED rule 
(4) does not even depend on observations yt obtained 
along the way. Nevertheless, the maximum informa- 
tion gain after T rounds will play a prominent role 
in our regret bounds, forging an important connection 
between GP optimization and experimental design. 

3. GP-UCB Algorithm 

For sequential optimization, the ED rule (4) can be 
wasteful: it aims at decreasing uncertainty globally, 
not just where maxima might be. Another idea is to 
pick points as Xt — argmax^g^ /it_i(a;), maximizing 
the expected reward based on the posterior so far. 
However, this rule is too greedy too soon and tends 
to get stuck in shallow local optima. A combined 
strategy is to choose 

Xt = argmax^f_i(a;) + P^^at-iix), (6) 
xeD 

where /3t are appropriate constants. This latter objec- 
tive prefers both points x where / is uncertain (large 
CTt_i(-)) and such where we expect to achieve high 
rewards (large /it„i(-)): it implicitly negotiates the 
exploration-exploitation tradeoff. A natural interpre- 
tation of this sampling rule is that it greedily selects 
points X such that f{x) should be a reasonable upper 
bound on f{x*), since the argument in (6) is an upper 
quantile of the marginal posterior P{f{x)\yt_^). We 
call this choice the Gaussian process upper confidence 
bound rule (GP-UCB), where /?t is specified depending 
on the context (see Section 4). Pseudocode for 
the GP-UCB algorithm is provided in Algorithm 1. 
Figure 2 illustrates two subsequent iterations, where 
GP-UCB both explores (Figure 2(b)) by sampling an 
input X with large at_i{x) and exploits (Figure 2(c)) 
by sampling x with large fit-i{x). 

The GP-UCB selection rule Eq. 6 is motivated by the 
UCB algorithm for the classical multi-armed bandit 
problem (Auer et al., 2002; Kocsis & Szcpesvari, 
2006) . Among competing criteria for GP optimization 
(see Section 1), a variant of the GP-UCB rule has 
been demonstrated to be effective for this application 
(Dorard et al., 2009). To our knowledge, strong 
theoretical results of the kind provided for GP-UCB in 
this paper have not been given for any of these search 
heuristics. In Section 6, we show that in practice 
GP-UCB compares favorably with these alternatives. 

If D is infinite, finding Xt in (6) may be hard: the 
upper confidence index is multimodal in general. 
However, global search heuristics are very effective in 
practice (Brochu et al., 2009). It is generally assumed 



Algorithm 1 The GP-UCB algorithm. 

Input: Input space D; GP Prior /io — 0, (Jq, k 
for t = 1,2, ... do _ 

Choose Xt = aigmax fit-i{x) + l3ta-t-i{x) 

Sample yt^f{xt)+ et 

Perform Bayesian update to obtain /ij and at 
end for 



that evaluating / is more costly than maximizing the 
UCB index. 

UCB algorithms (and GP optimization techniques 
in general) have been applied to a large number of 
problems in practice (Kocsis & Szcpesvari, 2006; 
Pandey & Olston, 2007; Lizotte et al., 2007). Their 
performance is well characterized in both the finite 
arm setting and the linear optimization setting, but 
no convergence rates for GP optimization are known. 

4. Regret Bounds 

We now establish cumulative regret bounds for GP 
optimization, treating a number of different settings: 
/ - GP(0,A:(a;,a;')) for finite D, / - GP(0, fc(£c, a;')) 
for general compact D, and the agnostic case of arbi- 
trary / with bounded RKHS norm. 

GP optimization generalizes stochastic linear opti- 
mization, where a function / from a finite-dimensional 
linear space is optimized over. For the linear case, Dani 
et al. (2008) provide regret bounds that explicitly de- 
pend on the dimensionality"^ d. GPs can be seen as 
random functions in some infinite-dimensional linear 
space, so their results do not apply in this case. This 
problem is circumvented in our regret bounds. The 
quantity governing them is the maximum information 
gain jt after T rounds, defined as: 



(7) 



where I(y^;/^) = I(y^;/) is defined in (3). Recall 
that I(y^;/^) = ^ log |/ + cr^^K^I, where Ka = 
[k{x,x')]x.x'eA is the covariance matrix of = 
[f{x)]xeA associated with the samples A. Our regret 
bounds are of the form O* {^/TPtIt): where fir is the 
confidence parameter in Algorithm 1, while the bounds 
of Dani et al. (2008) are of the form O* {y/TPr^) (d 
the dimensionality of the linear function space). Here 
and below, the O* notation is a variant of O, where 
log factors are suppressed. While our proofs - all pro- 
vided in the Appendix - use techniques similar to those 
of Dani et al. (2008), we face a number of additional 



^ In general, d is the dimensionality of the input space 
D, which in the finite-dimensional linear case coincides 
with the feature space. 



(a) Temperature data (b) Iteration t (c) Iteration t + 1 

Figure 2. (a) Example of temperature data collected by a network of 46 sensors at Intel Research Berkeley. (b,c) Two 
iterations of the GP-UCB algorithm. It samples points that are either uncertain (b) or have high posterior mean (c). 



significant technical challenges. Besides avoiding the 
finite-dimensional analysis, we must handle confidence 
issues, vi^hich are more delicate for nonlinear random 
functions. 

Importantly, note that the information gain is a prob- 
lem dependent quantity — properties of both the ker- 
nel and the input space will determine the growth of 
regret. In Section 5, we provide general methods for 
bounding jt, either by efficient auxiliary computa- 
tions or by direct expressions for specific kernels of 
interest. Our results match known lower bounds (up 
to log factors) in both the if-armed bandit and the 
d-dimensional linear optimization case. 

Bounds for a GP Prior. For finite D, we obtain 
the following bound. 

Theorem 1 Let S e (0, 1) and Pt = 
21og(|i:)|t27rV6(5). Running GP-UCB with ft for 
a sample f of a GP with mean function zero and 
covariance function k{x,x'), we obtain a regret bound 
of O* {yjT^T log \D\) with high probability. Precisely, 

Pr|i?T < ^/CiTPtIt Vr>l|>l-(5. 
where Ci = 8/ log{l + a-^). 

The proof methodology follows Dani et al. (2007) in 
that we relate the regret to the growth of the log 
volume of the confidence ellipsoid — a novelty in our 
proof is showing how this growth is characterized by 
the information gain. 

This theorem shows that, with high probability over 
samples from the GP, the cumulative regret is bounded 
in terms of the maximum information gain, forging a 
novel connection between GP optimization and exper- 
imental design. This link is of fundamental technical 
importance, allowing us to generalize Theorem 1 to 
infinite decision spaces. Moreover, the submodularity 
of I(y^; /^) allows us to derive sharp a priori bounds, 



depending on choice and parameterization of k (see 
Section 5). In the following theorem, we generalize 
our result to any compact and convex D C M"* under 
mild assumptions on the kernel function k. 

Theorem 2 Let D C [Ojt]'* be compact and convex, 
d G N, r > 0. Suppose that the kernel k[x,x') satisfies 
the following high probability bound on the derivatives 
of GP sample paths f : for some constants a,b > 0, 

Pr{sup,g^|a//aa;,| >L}<ae-(^/'')', j = l,...,d. 
Pick S G (0, 1), and define 

Pt = 21og(t2 27rV(3(5)) -I- 2dlog {t^dbr^/\og{Ma/5)^ . 

Running the GP-UCB with ft for a sample f of a 
GP with mean function zero and covariance function 
k{x,x'), we obtain a regret bound of O* (^/dTjx) with 
high probability. Precisely, with Ci — 8/log(l + (t^^) 
we have 

Pr I i?T < VCiT/3t7t + 2 VT > l| > 1 - (5. 

The main challenge in our proof (provided in the Ap- 
pendix) is to lift the regret bound in terms of the 
confidence ellipsoid to general D. The smoothness 
assumption on k{x,x') disqualifies GPs with highly 
erratic sample paths. It holds for stationary kernels 
k{x,x') = k{x — x') which are four times differen- 
tiable (Theorem 5 of Ghosal & Roy (2006)), such as the 
Squared Exponential and Matern kernels with u > 2 
(see Section 5.2), while it is violated for the Ornstein- 
Uhlenbeck kernel (Matern with v = 1/2; a. stationary 
variant of the Wiener process). For the latter, sam- 
ple paths / are nondifferentiable almost everywhere 
with probability one and come with independent in- 
crements. We conjecture that a result of the form of 
Theorem 2 does not hold in this case. 

Bounds for Arbitrary / in the RKHS. Thus far, 
we have assumed that the target function / is sampled 



from a GP prior and that the noise is A^(0, cr^) with 
known variance a^. We now analyze GP-UCB in an 
agnostic setting, where / is an arbitrary function 
from the RKHS corresponding to kernel k{x,x'). 
Moreover, we allow the noise variables et to be an ar- 
bitrary martingale difference sequence (meaning that 
E[et I £<t] = for all i G N), uniformly bounded by cr. 
Note that we still run the same GP-UCB algorithm, 
whose prior and noise model are misspecified in this 
case. Our following result shows that GP-UCB attains 
sublinear regret even in the agnostic setting. 

Theorem 3 Let S £ (0,1). Assume that the true 
underlying f lies in the RKHS HkiD) corresponding 
to the kernel k{x,x'), and that the noise et has zero 
mean conditioned on the history and is bounded by a 
almost surely. In particular, assume \\f\\f. < B and 
let Pt = '2B + 3007f log^ (t/S). Running GP-UCB with 
Pt, prior GP{Q,k{x,x')) and noise model N{0,a'^), 
we obtain a regret bound of O* {s/T [B ^Jyr + It)) with 
high probability (over the noise). Precisely, 

Pr |i?T < x/CiTPtIt Vr>l|>l-(5, 
where Ci = 8/ log(l -I- fj^^). 

Note that while our theorem implicitly assumes that 
GP-UCB has knowledge of an upper bound on 
standard guess-and-doubling approaches suffice if no 
such bound is known a priori. Comparing Theorem 2 
and Theorem 3, the latter holds uniformly over all 
functions / with ||/||fe < oo, while the former is a prob- 
abilistic statement requiring knowledge of the GP that 
/ is sampled from. In contrast, if / ^ GP(0, k{x, x')), 
then ll/llfc = oo almost surely (Wahba, 1990): sample 
paths are rougher than RKHS functions. Neither 
Theorem 2 nor 3 encompasses the other. 

5. Bounding the Information Gain 

Since the bounds developed in Section 4 depend on the 
information gain, the key remaining question is how to 
bound the quantity for practical classes of kernels. 

5.1. Submodularity and Greedy Maximization 

In order to bound 7t, we have to maximize the infor- 
mation gain F{A) — I(y^; /) over all subsets A c D of 
size T: a combinatorial problem in general. However, 
as noted in Section 2, F{A) is a submodular function, 
which implies the performance guarantee (5) for max- 
imizing F sequentially by the greedy ED rule (4). Di- 
viding both sides of (5) by 1 — 1/e, we can upper-bound 
7t by (1 — l/e)^^I(y^^; /), where At is constructed 
by the greedy procedure. Thus, somewhat counterin- 
tuitively, instead of using submodularity to prove that 
F(At) is near-optimal, we use it in order to show that 



jT is "near-greedy". As noted in Section 2, the ED 
rule does not depend on observations yt and can be 
run without evaluating /. 

The importance of this greedy bound is twofold. 
First, it allows us to numerically compute highly 
problem-specific bounds on 7^, which can be plugged 
into our results in Section 4 to obtain high-probability 
bounds on Rt. This being a laborious procedure, one 
would prefer a priori bounds for 77^ in practice which 
are simple analytical expressions of T and parameters 
of fc. In this section, we sketch a general procedure 
for obtaining such expressions, instantiating them for 
a number of commonly used covariance functions, 
once more relying crucially on the greedy ED rule 
upper bound. Suppose that D is finite for now, and 
let / = [f{x)]x^D, Kd = [k{x,x')]x^^i^D- Sampling 
/ at Xt, we obtain yt ^ N{vJ' f ,a^), where Vt G 
is the indicator vector associated with Xt- We can 
upper-bound the greedy maximum once more, by 
relaxing this constraint to = 1 in round t of the 
sequential method. For this relaxed greedy procedure, 
all Vt are leading eigenvectors of Kjj, since successive 
covariance matrices oi P{f\yt_i) share their eigenba- 
sis with Kd, while eigenvalues are damped according 
to how many times the corresponding eigenvector is 
selected. We can upper-bound the information gain 
by considering the worst-case allocation of T samples 
to the min{T, \D\} leading eigenvectors of Kj^: 

It < -, '^^^ 1 max V' ' log(l -I- cr^^mtAt), (8) 

subject to = ^'^d spec(/f_D) = {Ai > A2 > 

. . . }. We can split the sum into two parts in order 
to obtain a bound to leading order. The following 
Theorem captures this intuition: 

Theorem 4 For any T e N and any = 1, . . . , T; 

7T < 0{(T-^[B{%)T + T^,{\ognTT)]), 

where = '^'^'^ B{T^) = X]1=t +1 ^t- 

Therefore, if for some = o{T) the first eigenval- 
ues carry most of the total mass nx, the information 
gain will be small. The more rapidly the spectrum 
of Kd decays, the slower the growth of 77-. Figure 3 
illustrates this intuition. 

5.2. Bounds for Common Kernels 

In this section we bound jt for a range of commonly 
used covariance functions: finite dimensional linear. 
Squared Exponential and Matern kernels. Together 
with our results in Section 4, these imply sublinear 
regret bounds for GP-UCB in all cases. 
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Figure 3. Spectral decay (left) and information gain bound (right) for independent (diagonal), linear, squared exponential 
and Matern kernels (f = 2.5.) with equal trace. 



Finite dimensional linear kernels have the form 
k{x,x') = x^x'. GPs with this kernel correspond to 
random linear functions f{x) — w'^x, w ^ N{0,I). 

The Squared Exponential kernel is k(x,x') = 
exp(— (2^^)~^ ||a; — a;'|P), / a lengthscale parameter. 
Sample functions are differentiable to any order 
almost surely (Rasmussen & Williams, 2006). 

The Matern kernel is given by k{x, x') = 
{2^-'' /T{v))r''B^{r), r = /l)\\x - x'\\, where v 

controls the smoothness of sample paths (the smaller, 
the rougher) and is a modified Bessel function. 
Note that as ^ — >■ oo, appropriately rescaled Matern 
kernels converge to the Squared Exponential kernel. 

Figure 4 shows random functions drawn from GP dis- 
tributions with the above kernels. 

Theorem 5 Let Z) C M"^ he compact and convex, d E 
N. Assume the kernel function satisfies k{x,x') < 1. 

1. Finite spectrum. For the d-dimensional Bayesian 
linear regression case: = 0{d\ogT^. 

2. Exponential spectral decay. For the Squared 
Exponential kernel: 7t — C'((logT)''+^) . 

3. Power law spectral decay. For Matern kernels 
with J/ > 1.- 7T = C'(T'*(''+i)/(2'^+'i(d+i))(iogr)). 

A proof of Theorem 5 is given in the Appendix, , we 
only sketch the idea here. 7t is bounded by Theo- 
rem 4 in terms the eigendecay of the kernel matrix 
K£,. If D is infinite or very large, we can use the 
operator spectrum of fc(a;,x'), which likewise decays 
rapidly. For the kernels of interest here, asymptotic 
expressions for the operator eigenvalues are given 
in Secger ct al. (2008), who derived bounds on the 
information gain for fixed and random designs (in 
contrast to the worst-case information gain considered 
here, which is substantially more challenging to 
bound). The main challenge in the proof is to ensure 



the existence of discretizations Dt C dense in the 
limit, for which tail sums B(T^,)/nT in Theorem 4 are 
close to corresponding operator spectra tail sums. 

Together with Theorems 2 and 3, this result guaran- 
tees sublinear regret of GP-UCB for any dimension 
(see Figure 1). For the Squared Exponential kernel, 
the dimension d appears as exponent of log T only, so 
that the regret grows at most as O* {-s/T {\ogT)^^ ) 
the high degree of smoothness of the sample paths 
effectively combats the curse of dimensionality. 

6. Experiments 

We compare GP-UCB with heuristics such as the 
Expected Improvement (EI) and Most Probable 
Improvement (MPI), and with naive methods which 
choose points of maximum mean or variance only, 
both on synthetic and real sensor network data. 

For synthetic data, we sample random functions from a 
squared exponential kernel with lengthscale parameter 
0.2. The sampling noise variance was set to 0.025 or 
5% of the signal variance. Our decision set D = [0, 1] 
is uniformly discretized into 1000 points. We run 
each algorithm for T — 1000 iterations with 5 = 0.1, 
averaging over 30 trials (samples from the kernel). 
While the choice of as recommended by Theorem 1 
leads to competitive performance of GP-UCB, we 
find (using cross-validation) that the algorithm is 
improved by scaling j3t down by a factor 5. Note that 
we did not optimize constants in our regret bounds. 

Next, we use temperature data collected from 46 sen- 
sors deployed at Intel Research Berkeley over 5 days at 
1 minute intervals, pertaining to the example in Sec- 
tion 2. We take the first two-thirds of the data set to 
compute the empirical covariance of the sensor read- 
ings, and use it as the kernel matrix. The functions / 
for optimization consist of one set of observations from 
all the sensors taken from the remaining third of the 
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Figure 5. Comparison of performance; GP-UCB and various heuristics on synthetic (a), and sensor network data (b, c) 



data set, and the results (for T — 46, — 0.5 or 5% 
noise, S — 0.1) were averaged over 2000 possible 
choices of the objective function. 

Lastly, we take data from traffic sensors deployed along 
the highway 1-880 South in California. The goal was to 
find the point of minimum speed in order to identify 
the most congested portion of the highway; we used 
traffic speed data for all working days from 6 AM to 
11 AM for one month, from 357 sensors. We again 
use the covariance matrix from two-thirds of the data 
set as kernel matrix, and test on the other third. The 
results (for T = 357, = 4.78 or 5% noise, S = 0.1) 
were averaged over 900 runs. 

Figure 5 compares the mean average regret incurred 
by the different heuristics and the GP-UCB algorithm 
on synthetic and real data. For temperature data, 
the GP-UCB algorithm and EI heuristic clearly 
outperform the others, and do not exhibit significant 
difference between each other. On synthetic and traf- 
fic data MPI does equally well. In summary, GP-UCB 
performs at least on par with the existing approaches 
which are not equipped with regret bounds. 

7. Conclusions 

We prove the first sublinear regret bounds for GP 
optimization with commonly used kernels (see Fig- 
ure 1), both for / sampled from a known GP and / of 
low RKHS norm. We analyze GP-UCB, an intuitive, 



Bayesian upper confidence bound based sampling rule. 
Our regret bounds crucially depend on the information 
gain due to sampling, establishing a novel connection 
between bandit optimization and experimental design. 
We bound the information gain in terms of the kernel 
spectrum, providing a general methodology for obtain- 
ing regret bounds with kernels of interest. Our exper- 
iments on real sensor network data indicate that GP- 
UCB performs at least on par with competing criteria 
for GP optimization, for which no regret bounds are 
known at present. Our results provide an interesting 
step towards understanding exploration-exploitation 
tradeoffs with complex utility functions. 
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A. Regret Bounds for Target Function 
Sampled from GP 

In this section, we provide details for the proofs of 
Theorem 1 and Theorem 2. In both cases, the strategy 

1/2 

is to show that |/(a;) — ^f_i(a;)| < |5^ at-i{x) for all 
t G N and all x E D, or in the infinite case, all x in 
a discretization of D which becomes dense as t gets 
large. 

A.l. Finite Decision Set 

We begin with the finite case, \D\ < oo. 

Lemma 5.1 Pick 5 e (0, 1) and set fit = 
21og(|i:»|7rt/(5), where Y.t>iT^i^ = 1, TTf > 0. Then, 

\f{x) - tH-i{x)\ < pl'\t-i{x) yxeDyt>i 

holds with probability > I — S. 



Proof Fix t > 1 and x ^ D. Conditioned on yi_i = 
(j/i; . . . , yt_i), {a;i, . . . , a;f_i} are deterministic, and 
f{x) ^ Ni^it-i{x),'yLi{x))- Now, if r ^ 7V(0,1), 
then 

Pr{r > c} = e-'"/^{2n)-^/^ J ^-^r-cf l^-c{r-c) 

for c > 0, since e^'^^'""^' < 1 for r > c. Therefore, 
Pr{|/(a;) - ^it-l{x)\ > py^<Jt_,{x)} < e-^*/^, using 
r = if{x)-^t-i{x))/at^i{x) andc = ^l^^. Applying 
the union bound, 

\f{x)-fitMx)\<l3l^^at^i{x) yxeD 

holds with probability > 1 — \D\e^^*/'^. Choosing 
|D|e-A/2 = S/iTt and using the union bound for 
t E N, the statement holds. For example, we can use 
TTt = ttH'^/G. m 

Lemma 5.2 Fix t > I. If \f{x) - /it_i(a;)| < 

1 /2 

/3( at-i{x) for all x G D, then the regret rt is 

1 /2 

bounded by 2/3/ at-i{xt). 

1/2 

Proof By definition of a;^: /if_i(a;j)+/3( (Tf_i(a;j) > 
^it^i{x*) + l3l^^at-i{x^) > fix*). Therefore, 

rt = fix*) - fixt) < Pl/'^CTt-iixt) + iH-iixt) - fixt) 
< 2/3yVt„i(a;t). 



Lemma 5.3 The information gain for the points se- 
lected can be expressed in terms of the predictive vari- 
ances. Iffr = ifixt))eR'^: 

KVt; fr) = 2 Ht=i + (^^^(^t-ii^t)) ■ 

Proof Recall that livrl fr) = ^iVr) ~ 
(l/2)log|27rea2/|. Now, H(y^) = Rivr-i) + 
H(yrlyT-i) = HiVT-i) + log(27re(a2 + <jf_,ixT)))/2. 
Here, we use that Xi, . . . ,Xt are deterministic con- 
ditioned on Ht-ii and that the conditional variance 
a'^^iixx) does not depend on Vt-i- The result fol- 
lows by induction. ■ 

Lemma 5.4 Pick S e (0, 1) and let j3t be defined as in 
Lemma 5.1. Then, the following holds with probability 
>l-5: 

ELi - l^TCi^^yT\ fr) < CipTiT vr > 1, 



where Ci 8/log(l + a-^) > . 

Proof By Lemma 5.1 and Lemma 5.2, we have that 
{rt < 4:l3t(jf_^ixt) Wt > 1} with probability > 1 - S. 
Now, /3t is nondecreasing, so that 

4Aa2_i(a;0 < 4PT<jH<^'^^t-iixt)) 

< 4(3t'7^C2 log(l + a-^af_^ixt)) 

with C2 = <T^^/log(l + (T^^) > 1, since 
s2 < C2log(l + s^) for s e [0,cr-2], and 
a''^af_iixt) < cr^^kixt.Xt) < cr^^ . Noting that 
Ci — 8(7^ C2, the result follows by plugging in the 
representation of Lemma 5.3. ■ 

Finally, Theorem 1 is a simple consequence of 
Lemma 5.4, since 

Rt < TJ2i=irt by the Cauchy- 

Schwarz inequality. 

A. 2. General Decision Set 

Theorem 2 extends the statement of Theorem 1 to 
the general case of 13 C M'' compact. We cannot 
expect this generalization to work without any as- 
sumptions on the kernel kix,x'). For example, if 
kix,x') = e^ll^^^ II (Ornstein-Uhlenbeck), while sam- 
ple paths / are a.s. continuous, they are still very er- 
ratic: / is a.s. nondifferentiable almost everywhere, 
and the process comes with independent increments, a 
stationary variant of Brownian motion. The additional 
assumption on k in Theorem 2 is rather mild and is 
satisfied by several common kernels, as discussed in 
Section 4. 

Recall that the finite case proof is based on Lemma 5.1 
paving the way for Lemma 5.2. However, Lemma 5.1 
does not hold for infinite D. First, let us observe that 
we have confidence on all decisions actually chosen. 

Lemma 5.5 Pick S E (0, 1) and set Pt — 21og(7rt/(5), 
where X]t>i '^■t^ — ^, T^t > 0. Then, 

\fiXt) - ^it^liXt)\ < p\'^Gt-xiXt) Vi > 1 

holds with probability > I — S. 

Proof Fix t > 1 and x E D. Conditioned on 
Ut-i = ivi, ■ ■ ■ ,yt-i), {xi, . . . ,xt-i} are determin- 
istic, and fix) ^ N i^Mt-iix) , cj1_iix)) . As before, 

Pr{|/(a3,) - lit-iixt)\ > py'at^xt)} < e-^/^. 
Since e^^*/^ = S/nt and using the union bound for 
t E N, the statement holds. ■ 

Purely for the sake of analysis, we use a set of dis- 
cretizations Dt C D, where Dt will be used at time 



t in the analysis. Essentially, we use this to obtain a 
valid confidence interval on x* . The following lemma 
provides a confidence bound for these subsets. 

Lemma 5.6 Pick 5 £ (0, 1) and set j3t = 

2 log(|-Df |7r(/(5), where X]t>i'''r^ = 1, tt* > 0. Then, 

\f{x) - iit-i{x)\ < py^at-i{x) yx e A, yt > 1 
holds with probability > I — S. 

Proof The proof is identical to that in Lemma 5.1, 
except now we use Dt at each timestep. ■ 

Now by assumption and the union bound, we have that 
Pr{Vj, Vx e D, \df/{dxj)\ <L}>1- dae-^^l^\ 

which implies that, with probability greater than 1 — 
dae^^ 1^ , we have that 

-ix^D, \]{x)- ]{x')\<L\\x~x'\\r. (9) 

This allows us to obtain confidence on x* as follows. 

Now let us choose a discretization Dt of size (rt)'' so 
that for all x £ Dt 

\\x - [x]t\\i < rd/Tt 

where [x]t denotes the closest point in to a?. A suf- 
ficient discretization has each coordinate with rj uni- 
formly spaced points. 

Lemma 5.7 Pick S E (0, 1) and set j3t = 
21og(27rf/(5) + 4d log{dtbr ^log(2da /S)), where 
J2t>i'^i^^ = 1, TTf > 0. Let Tt = dt'^br^/log{2da/S) 
Let [x*]t denotes the closest point in Dt to x* . Hence, 
Then, 

\f{x*) - tit-i{[x*]t)\ < Pl/'cTt^i{[x*]t) + ^ Vi > 1 
holds with probability > 1 — (5. 

Proof Using (9), we have that with probability 
greater than 1 — 6/2, 

Vx e D, \f{x) ~ f{x')\ < b^\og{2da/S)\\x - x'\U . 
Hence, 

yx e Dt, \fix) - f{[x]t)\ < rdb^\og{2da/5)lTt . 
Now by choosing Tt = dt^br yJ\og{2da / 5) , we have that 

yxEDt, \f{x)- f{[x]t)\ < 4 



This implies that |A| = {dt'^bry/log{2da/5)Y. Using 
(5/2 in Lemma 5.6, we can apply the confidence bound 
to [x*]t (as this lives in Dt) to obtain the result. ■ 

Now we are able to bound the regret. 

Lemma 5.8 Pick 5 E (0, 1) and set fit = 
21og(47rt/(5) + Ad log{dtbr ^log(4da /S)), where 
^oj^TTj"^ = 1, TTt > 0. Then, with probability greater 
than 1 — S, for all t £ N, the regret is bounded as 
follows: 

rt<2py\t-i{xt) + ^. 

Proof We use 6/2 in both Lemma 5.5 and Lemma 5.7, 
so that these events hold with probability greater 
than 1 — ^. Note that the specification of f3t in the 
above lemma is greater than the specification used in 
Lemma 5.5 (with (5/2), so this choice is valid. 

1/2 

By definition of Xt: nt-i{xt) + A cFt-i{xt) > 
^lt-li[x*]t) + I3y^at-i{[x*]t). Also, by Lemma 5.7, we 
have that Ht-i{[x*]t)+Pl^^at-i{[x*]t) + l/t^ > f{x*), 
which implies ^it~i{xt) + p]'"^ at-i{xt) > f{x*)-l/t^. 
Therefore, 

Tt - f{x*) - f{Xt) 

< Pl^^at-i{xt) + \/e + p.t-x{xt) - f{xt) 
<2pl'^at-i{xt) + l/t' . 

which completes the proof. ■ 

Now we are ready to complete the proof of Theorem 2. 
As shown in the proof of Lemma 5.4, we have that with 
probability greater than 1 — 5, 

^^^^ Afit'jtiixt) < C^hlT VT > 1, 
so that by Cauchy-Schwarz: 

^^^^ 2pl'\t-i{xt) < VCiTPtIt VT > 1, 
Hence, 

^ VCiTpTiT + ^V6 vr>i, 

(since — ""^Z^)- Theorem 2 now follows. 

Finally, we now discuss the additional assumption on 
k in Theorem 2. For samples / of the GP, consider 
partial derivatives df/{dxj) of this sample path for 
j = l,...,d. Theorem 5 of Ghosal & Roy (2006) 



states that if derivatives up to fourth order exists 
for {x,x') ^ k{x,x'), then / is almost surely con- 
tinuously differentiable, with df/{dxj) distributed as 
Gaussian processes again. Moreover, there are con- 
stants a, bj > such that 

Pr jsup \df/{dxj)\ > l\ < ae-^^^\ (10) 

Picking L = [log(da2/(5)/minj we have that 

gg-^bjL" < 5/ (2d) for all j = so that for 

Ki = (f/^L, by the mean value theorem, we have 
Pr{|/(a;)-/(a;')l < K^\\x-x'\\\l x,x' e D} > 1-5/2. 

Also, note that Ki = 0{{\og5-^Y/'^). 

This statement is about the joint distribution of /(•) 
and its partial derivatives w.r.t. each component. For 
a certain event in this sample space, all df/{dxj) ex- 
ist, are continuous, and the complement of (10) holds 
for all j. Theorem 5 of Ghosal & Roy (2006), together 
with the union bound, implies that this event has prob- 
ability > 1 — (5/2. Derivatives up to fourth order exist 
for the Gaussian covariance function, and for Matern 
kernels with v>2 (Stein, 1999). 

B. Regret Bound for Target Function 
in RKHS 

In this section, we detail a proof of Theorem 3. Recall 
that in this setting, we do not know the generator of 
the target function /, but only a bound on its RKHS 
norm ||/||a;. 

Recall the posterior mean function firi') ^md posterior 
covariance function fcT(', •) from Section 2, conditioned 
on data {xt,yt), t = 1, . . . ,T. It is easy to see that the 
RKHS norm corresponding to kx is given by 

This implies that T-Lk{D) — Hkj.{D) for any T, while 
the RKHS inner products are different: \\f\\kT ^ II /lU- 
Since (/(•), kT{-,x))kT = f{x) for any / e HkriD) by 
the reproducing property, then 

\fit{x)-f{x)\<kT{x,x)'^'\\fit-f\\k^ ^^^^ 

= crT(a;)||^t - f\\k^ 
by the Cauchy-Schwarz inequality. 

Compared to our other results. Theorem 3 is an agnos- 
tic statement, in that the assumptions the Bayesian 
UCB algorithm bases its predictions on differ from 
how / and data yt are generated. First, / is not 
drawn from a GP, but can be an arbitrary function 



from Hk{D)- Second, while the UCB method assumes 
that the noise St = yt — f{xt) is drawn independently 
from N{0,(j'^), the true sequence of noise variables et 
can be a uniformly bounded martingale difference se- 
quence: Et < a- for all t G N. All we have to do in order 
to lift the proof of Theorem 1 to the agnostic setting 
is to establish an analogue to Lemma 5.1, by way of 
the following concentration result. 

Theorem 6 Let 6 € (0,1). Assume the noise vari- 
ables £t are uniformly bounded by a. Define: 

l3t^2\\f\\l + 300-itln'{t/S), 

Then 

Pr {VT, Vx e D, \ht{x) - fix)\ < p^^^arix)} > 1-6. 

B.l. Concentration of Martingales 

In our analysis, we use the following Bernstein-type 
concentration inequality for martingale differences, 
due to Frecdman (1975) (see also Theorem 3.15 of Mc- 
Diarmid 1998). 

Theorem 7 (Freedman) Suppose Xi, . . . , Xt is a 

martingale difference sequence, and b is an uniform 
upper bound on the .steps Xi. Let V denote the sum of 
conditional variances, 

T^ = V" Var(X, |Xi,...,X,_i). 

Then, for every a, z; > 0, 

Pr{^X.>a«.c/F<.}<exp(^-^). 

B.2. Proof of Theorem 6 

We will show that: 

Pr{VT, IImt-ZIIL </^T+i} >l-<5. 

Theorem 6 then follows from (11). Recall that £t = 
yt — f{xt). We will analyze the quantity Zt — 
W^T — /II fey, measuring the error of /zt as approxi- 
mation to / under the RKHS norm of HkriD). The 
following lemma provides the connection with the in- 
formation gain. This lemma is important since our 
concentration argument is an inductive argument — 
roughly speaking, we condition on getting concentra- 
tion in the past, in order to achieve good concentration 
in the future. 

Lemma 7.1 We have that 

V mm{cr-^(7^_-i^{xt),a} < - — ^ -jt, a > 0. 

^^t=i log(l + a) 



Proof We have that min{r, a} < (a/log(l + 
a)) log(l +r). The statement follows from Lemma 5.3. 



The next lemma bounds the growth of Zt- It is for- 
mulated in terms of normalized quantities: £4 = St/cr, 
/ = //<7, Jj-t = /^t/f, (Tt — o't/o'- Also, to ease nota- 
tion, we will use /if_i, (Jt-i as shorthand for fj,t_i(xt), 

<Tt-l{Xt). 



Lemma 7.2 For all T eN, 



Zt < 11/11' 



fit-i - f{xt) 



Proof If OLt — {Kt + (7^7) ^j/j, then ^t(a;) = 
a.fkt{x). Then, {fJ.T,f)k = /?aT, llA^Tllfe = 
yTpOLT — a'^WoLrW'^ ■ Moreover, for t < T, iiT{xt) = 
Kt{Kt + a'^I)^^y'p ~ Vt — cr'^ctt- Since Zt = 
WfJ-T- f\\k + <7^^Y.t<Til^T{xt)- f{xt))'^, we have that 



Zj 



I/I 



ifrpOiT + Vt'^t — c I|ckt|| 

Now, ~y'^{KT + (T^iy^yT = 21ogP(yj.), where 
means that we drop determinant terms, thus con- 
centrate on quadratic functions. Since logP(y'2-) = 
EtlogP(yt|y<t) = J2t'^ogN{yt\^^t-lixt),(T^_-^{xt) + 
CT^), we have that 

^.Ttjy- I ^2T^l-K, jyt " Mt-i)^ 



- f{xt 



E., 



R 



with i? = EM-l - f{Xt)?l{cj^ + <7tl) > 0. 

Dropping ~R and changing to normalized quantities 
concludes the proof. ■ 



Now, since £4 is a martingale difference sequence with 
respect to the histories 'H<t and Mt/et is determinis- 
tic given 'H<t, Mt is a martingale difference sequence 
as well. Next, we show that with high probability, 
the associated martingale X)t=i -^^t '^^^^ grow too 
large. 

Lemma 7.3 Given 5 G (0, 1) and j3t as defined in in 
Theorem 6, we have that 



Pr |vr, J2 ^ Pt+i/2 \ >1-S, 



The proof is given below in Section B.3. Equipped 
with this lemma, we can prove Theorem 6. 

Proof [of Theorem 6] It suffices to show that the high- 
probability event described in Lemma 7.3 is contained 
in the support of Et for every T. We prove the latter 
by induction on T. 

By Lemma 7.2 and the definition of /3i, we know that 
< II /life — Pi- Hence Eq = 1 always. Now suppose 
the high-probability event of Lemma 7.3 holds, in par- 
ticular X^tLi ^'^t — /5t+i/2. For the inductive hypoth- 
esis, assume Et-i = 1- Using this and Lemma 7.2: 



Zi 



^ et{nt~i - fixt)) ^ £?cr? 



^ II /l|2 o tHA^t-l ~ J \xt)) ^t"f-l 

SII/IU + 2|^— TTs^ + ljinc; 

t=l t=l *-l 

T 

< ||/||2 + Pr+i/'l + ™n{2^?-i, 1} 



< ||/|Pfe+/3T+i/2+(2/log2)7T</3' 



T+l- 



The equality in the second step uses the inductive 
hypothesis. Thus we have shown Et = 1, completing 
the induction. ■ 



We now define a useful martingale difference sequence. 
First, it is convenient to define an "escape event" Et 
as: 

Et = l{Zt < /3t+i for all t < T} 

where I{-} is the indicator function. Define the random 
variables Mt by 



Mt = 2ItEt. 



Mt-i - f{xt) 



B.3. Concentration 

What remains to be shown is Lemma 7.3. While the 
step sizes \Mt \ are uniformly bounded, a standard ap- 
plication of the Hoeffding-Azuma inequality leads to 
a bound of T'^/^, too large for our purpose. We use 
the more specific Theorem 7 instead, which requires 
to control the conditional variances rather than the 
marginal variances which can be much larger. 

Proof [of Lemma 7.3] Let us first obtain upper bounds 



on the step sizes of our martingale. 



bound: 



\Mt\ ^ 2\it\Et-i 
< 2\it\Et^i 



,1/2 



Pr ^ ^T+i/2 for some t| 



<2\et\Et-iPt min{crt_i,l/2}, (12) completing the proof of Lemma 7.3. 



where the first inequality follows from the definition 
of Et- Moreover, r/(l + r^) < min{r, 1/2} for r > 0. 
Therefore, \Mt \ < (3ji , since |et| < 1 and /3t in nonde- 
creasing. Next, we bound the sum of the conditional 
variances of the martingale: 



Vt ■■= Var (Mt I Ml .. . Mt-i) 

< 4/3t Y^^^^ min{52_i, 1/4} 



In the last line, we used Lemma 7.1 with a = 1/4, not- 
ing that 8a/log(l -t-a) < 9. Since we have established 
that the sum of conditional variances, Vt, is always 
bounded by 9I3tJTi we can apply Theorem 7 with pa- 

1 /2 

rameters a — I3t+i/2, b — Pqij^i and v — QPt^t to 
get 

P'^{eL^^*-^^+i/^} 

= Pr |E^_i - /5t+i/2 and Vt < 9/3t7t| 
^ -(/3t+i/2)2 \ 



< exp 



exp 



.2(9/3T7T) + i(/3T+i/2)4Vi, 



T+l 

J27T + i/3T+i 



^ max <; exp ( j ,exp I 



Note that our choice of /3t+i satisfies: 
max{l447Tlog(rV(5), ((8/3) log(TV,5))'} < /3. 



T+l- 



Therefore, the previous probability is bounded by 
(5/r^, whereas the last inequality follows from the def- 
inition of Pt+i- With a final application of the union 



C. Bounds on Information Gain 

In this section, we show how to bound 7t, the max- 
imum information gain after T rounds, for compact 
Z) C M'^ (assumptions of Theorem 2) and several com- 
monly used covariancc functions. In this section, we 
assume"* that k{x, a;) = 1 for all x G D. 

The plan of attack is as follows. First, we note that the 
argument of 7t, I(y^;/^) is a submodular function, 
so 7t can be bounded by the value obtained by greedy 
maximization. Next, we use a discretization Dt C D 
with ut — \Dt\ ~ T'^ with nearest neighbour distance 
o(l), consider the kernel matrix Ku^ S j^nrxriT^ g^j-^^j 
bound 7t by an expression involving the eigenvalues 
{At} of this matrix, which is done by a further re- 
laxation of the greedy procedure. Finally, we bound 
this empirical expression in terms of the kernel opera- 
tor eigenvalues of k w.r.t. the uniform distribution on 

D. Asymptotic expressions for the latter are reviewed 
in Seeger et al. (2008), which we plug in to obtain 
our results. A key step in this argument is to ensure 
the existence of a discretization D^, for which tails 
of the empirical spectrum can be bounded by tails of 
the process spectrum. We will invoke the probabilistic 
method for that. 

C.l. Greedy Maximization and Discretization 

In this section, we fix T e N and assume the existence 
of a discretization Dt C D, tit — \Dt\ on the order 
of T^, such that: 

VxeD 3[x]t eDt ■■ \\x- [x]t\\ - 0{T-'''^). (13) 

We come back to the choice of Dt below. We restrict 
the information gain to subsets A C Dt'- 

AcDt,\A\=T 

Of course, Jt Jt, but we can bound the slack. 

* Without loss in generality. We use this assumption 
below to ensure that n^^txK ~ J k{x, x) dx. If k{x, x) 
is not constant, this is approximately true by the law of 
large numbers, and our result below remains valid. 



Lemma 7.4 Under the assumptions of Theorem 2, 
the information gain FT{{xt}) — (1/2) log |/ + 
a^^K^^^yl is uniformly Lipschitz-continuous in each 
component Xt (z D. 

Proof The assumptions of Theorem 2 imply that 
the kernel K{x,x') is continuously difFerentiable. 
The result follows from the fact that F-riixt}) is 
continuously differentiable in the kernel matrix Ifj^^}. 



Lemma 7.5 Let Dt be a discretization of D such that 
(13) holds. Under the assumptions of Theorem 2, we 
have that 

{)<1T-1T ^0{T^-''''^). 

Proof Fix T e N, and let A = {xi,...,xt} be a 
maximizer for 77^. Consider neighbours [xt\T S Dt 
according to (13), [A\t = {[tctjr}- Then, 

< 7T-7T < lT-l{y[A]T^f[A]T.) = Ft{A)~Ft{[A]t), 

where FT({a;t}) = (1/2) log |7 + a-^K^^^jj. By 
Lemma 7.4, Ft is uniformly Lipschitz-continuous 
in each component, so that - ^{V[a\t'^ f[A]T)\ = 
0(Tmaxt \\xt - [xt]T\\) = 0{T^-^''^) by (13) and the 
mean value theorem. ■ 



We concentrate on 77- in the sequel. Let Kj)^ — 
[k{x,x')]x,x'eDT be the kernel matrix over the en- 
tire Dt, and Kdt = UAU^ its eigendecomposi- 
tion, with Ai > A2 > • • • > and U = [iti M2 • ■ ■ ] 
orthonormal. Here, if T > tit, define At = for 
t = ht + 1, . ■ ■ ,T. Information gain maximization 
over a finite Dt can be described in terms of a sim- 
ple linear-Gaussian model over the unknown / G M"^ , 
with prior P{f) = N{0, K u^,,) and likelihood poten- 
tials P{yt\f) = N{vJ f ,a'^) with unit-norm features, 
llt^tll = 1. With the following lemma, we upper-bound 
"fT by way of two relaxations. 

Lemma 7.6 For any T > 1, we have that 

IT < z — r max log(l -I- cr^^TOfAf), 

1 — e mi,...,niT ^—^t=l 

subject to mt G N, my — T, where Ai > A2 > • • ■ 
is the spectrum of the kernel matrix K . Here, if 
T > ut, then mt — for t > ut- 

Proof As shown by Krausc & Gucstrin (2005), 
the function F{A) = l{yj^;f) is submodular. In 



the particular case considered here, this can be seen 
as follows: F{A) = H(y^) — II(y^ | /), where 
the entropy H(y^) is a (not-necessarily monotonic) 
submodular function in A, and since the noise is 
conditionally independent given /, H(j/^ | /) is 
an additive (modular) function in A. Subtracting 
a modular function preserves submodularity, thus 
F(A) is submodular. Furthermore, the information 
gain is monotonic in A (i.e., F{A) < F{B) whenever 
A C B) (Cover & Thomas, 1991). Thus, we can 
apply the result of Nemhauser et al. (1978)^ which 
guarantees that 73- is upper-bounded by 1/(1 — 1/e) 
times the value the greedy maximization algorithm 
attains. The latter chooses features of the form 
Vt — ^xt = [I{x=xt}] in each round, Xt £ Dt- We 
upper-bound the greedy maximum once more by 
relaxing these constraints to ||t>f|| = 1 only. In the 
remainder of the proof, we concentrate on this relaxed 
greedy procedure. Suppose that up to round t, it chose 
Vi,...,Vt-i. The posterior P(/|yt-i) has inverse 
covariance matrix St-'^x = -^By + cr^'^Vt-iVj-i, 
Vt-i — [vi . . . Vt-i], and the greedy procedure 
selects V so to maximize the variance v^Yit-iv: the 
eigenvector corresponding to Star's largest eigenvalue 
(by the Rayleigh-Ritz theorem). Since Sq = -K"_Dtj 
then Vi = Ui. Moreover, if all Vt', t' < t, have 
been chosen among U^s columns, then by the inverse 
covariance expression just given, Kt>j, and St-i have 
the same eigenvectors, so that Vt is a column of U as 
well. For example, if Vt = Uj, then comparing St-i 
and St, all eigenvalues other than the j-th remain 
the same, while the latter is shrunk. Therefore, 
after T rounds of the relaxed greedy procedure: 
Vt e {«!,..., Minin{T,nT}}' t = 1,...,T: at most the 
leading T eigenvectors of K can have been selected 
(possibly multiple times). If mt denotes the number 
that the t-ih column of U has been selected, we ob- 
tain the theorem statement by a final bounding step. ■ 



C.2. Prom Empirical to Process Eigenvalues 

The final step will be to relate the empirical spec- 
trum {At} to the kernel operator spectrum. Since 
log(l + a^^mtXt) < cr^^mtXt in Theorem 7.6, we will 
mainly be interested in relating the tail sums of the 
spectra. Let n{x) = V(£')~^I{^g£)} be the uniform 
distribution on D, V[D) = J^^j-, dx, and assume that 
k is continuous. Note that J k{x,x)ij,{x) dx = 1 by 
our assumption k{x,x) — 1, so that k is Hilbert- 

^While the result of Nemhauser et al. (1978) is stated 
in terms of finite sets, it extends to infinite sets as long as 
the greedy selection can be implemented efficiently. 



Schmidt on i2(/i). Then, Mercer's theorem (Wahba, 
1990) states that the corresponding kernel operator 
has a discrete eigenspectrum {{Xs, 4's{-))}, and 

k{x, x') = V Xs(j)six)(j)six'), 

^ ^ S>1 

where Ai > A2 > • • • > 0, and Ep,[<^s(a;)0t(a;)] = 
ds,t- Moreover, J2s>i < ^"^^ expan- 

sion of k converges absolutely and uniformly on D x 
D. Note that E.>i - E.>i A. E^[0,(a;)2] = 
J K{x,x)fi{x) dx = 1. In order to proceed from The- 
orem 7.6, we have to pick a discretization Dt for which 
(13) holds, and for which J2t>T ■^t ^'^^ much larger 
than J2t>T ^t- With the following lemma, we deter- 
mine sizes riT for which such discretizations exist. 

Lemma 7.7 Fix T e N, S > and e > 0. There 
exists a discretization d D of size 

nr = V{D){e/Vd)-''[\og{l/S)+d\og{Vd/e)+\ogViD)] 

which fulfils the following requirements: 

• e-denseness: For any x Cz D, there exists [x]t G 
Dt such that \\x — [a;]7-|| < e. 

• Ifsgec{K]j^) = {Ai > A2 > ...}, then for any 

= 1, . . . ,nT.- 

Proof First, if we draw ut samples Xj ^ IJ-{x) in- 
dependently at random, then Dt — {xj} is e-dense 
with probability > 1 — S. Namely, cover D with 
N = V{D){e/^/d)~'^ hypercubes of sidelength e/^/d, 
within which the maximum Euclidean distance is e. 
The probability of not hitting at least one cell is upper- 
bounded by N{1 - l/A^)"^. Since log(l - l/N) < 
— l/N, this is upper-bounded by (5 if > iV \og{N/S). 

Now, let S = ^^T^ Y^'t=i ■ Shawe- Taylor et al. 
(2005) show that E[S] > J2j=i>'t- If C is the 
event {Dt is dense }, then Pr(C) > 1 ~ 6. Since 
S < n'^^iiK Dt = 1 in any case, we have that 
E[S'|C] > ¥.[3] - ViiC") > J2j=i At - ^- By the 
probabilistic method, there must exist some Dt for 
which C and the latter inequality holds. ■ 

The following lemma, the equivalent of Theorem 4 in 
the context here, is a direct consequence of Lemma 7.6. 

Lemma 7.8 Let Dt be some discretization of D, 



ut = \Dt\- Then, for any = 1, . . . , min{T, ut}: 

1/2 / 
It < z T max {T^\og{rnT /a'^) 

1 — e~^ r=l,....T\ 

Proof We split the right hand side in Lemma 7.6 
at t = T^. Let r = X]t<T '^t- ^'-'^ t < T^,: 
log(l + mtAt/(7^) < \og{rnT / cr'^) , since At < ut- For 
t>n: \og{l+mtXt/a^) < rutXt/a^ < {T-r)Xt/a^. ■ 

The following theorem describes our "recipe" for ob- 
taining bounds on 7^ for a particular kernel k, given 
that tail bounds on Bk{T.^) = X]s>t are known. 

Theorem 8 Suppose that D C M'' is compact, and 
k{x,x') is a covariance function for which the ad- 
ditional assumption of Theorem 2 holds. Moreover, 
let Bk{T^) = X]s>T As, where {As} is the operator 
spectrum of k with respect to the uniform distribution 
over D. Pick t > 0, and let ut = C4T'^(logT) with 
C4 = 2V(L')(2t + 1). Then, the following bound holds 
true: 

1/2 / 
JT <z T max IT* log(rnT/o-^) 

1 — e^^ r=l,...,T\ 

+ C^a-\l - r/T){\ogT) {T-+^Bk{T,) + l)) 
for any G {1, . . . , ut} ■ 

Proof Let e = d^/'^T-^/'^ and 5 = T-(^+i). 
Lemma 7.7 provides the existence of a dis- 
cretization Dt of size ut which is e-dense, 
and for which 'T-y^ X^t^^i A* > X^t^i At — ^■ 
Since n^^X]"=iAt = 1 = Y,t>i^t, then 
St>T. At < Bk{T^) + S. The statement follows 
by using Lemma 7.8 with these bounds, and finally 
employing Lemma 7.5. ■ 



C.3. Proof of Theorem 5 

In this section, we instantiate Theorem 8 in order to 
obtain bounds on jt for Squared Exponential and 
Matern kernels, results which are summarized in The- 
orem 5. 

Squared Exponential Kernel 

For the Squared Exponential kernel fc, i?fc(T*) is given 
by Seeger et al. (2008). While ^(a;) was Gaussian 



there, the same decay rate holds for As w.r.t. uniform 
/i(a;), while constants might change. In hindsight, it 
turns out that r = d is the optimal choice for the 
discretization size, rendering the second term in The- 
orem 5 to be 0(1), which is subdominant and will be 
neglected in the sequel. We have that A^ < cB^^'"^ 
with B < 1. Following their analysis, 

^ — '3=0 

where a = - logS, (3 = aT^'^. Therefore, BkiT^) = 

We have to pick such that e^^ is not much larger 
than (Ttt-t)^^- Suppose that T* = [log(rnT)/a]'', so 
that e~'3 = [TutY^, P = log(rnT). The bound be- 
comes 

max (T* log (tot /o-^) 

r=l.,...,T\ 

+ _ r/r)(C5/3^-i + C4(logr))) 

with riT = C4T'^(logT). The first part dominates, 
so that r = T and 7T = 0{[\og{T'^+^{\ogT))Y+^) = 
0{{\ogTY+^). This should be compared with 
E[I(yy;/^)] = C'((logT)''+i) given by Seeger et al. 
(2008), where the Xt are drawn independently from 
a Gaussian base distribution. At least restricted to 
a compact set D, we obtain the same expression to 
leading order for max{a,j} 1(2/^; /t)- 

Matern Kernels 

For Matern kernels k with roughness parameter 
Bk{T^,) is given by Seeger et al. (2008) for the uni- 
form base distribution /i(a;) on D. Namely, As < 
^^^-(2u+d)/d fQj. almost all s e N, and Bk{T*) = 

0{tI match terms in the 7t bound, 

we choose T, = {TnTY/^'^''+'^\\og{TnT)Y (^ chosen 
below), so that the bound becomes 

max^^T, log(rnT/cr^) + (t-'^Y - r/T) 

X (C5T,(log(rnT))-''(2'^+'')/'' + C4(logT))) 

with UT = C4r^(logT). For k = -d/{2v + d), we ob- 
tain that the maximum over r is 0{T^ log{TnT)) = 
0{Y-'+^W{2'^+d)(iQgT)). Finally, we choose r = 
2i^d/{2iy+d{d+l)) to match this term with ©(T^^"'/''). 
Plugging this in, we have jt = 0{T^~'^''{logT)), 
77 = 2v+d{d+i) ■ Together with Theorem 2 (for 1/ > 2), 
we have that Rt — 0*{T^~^) (suppressing log fac- 
tors): for any v > 2 and any dimension d, the GP- 



UCB algorithm is guaranteed to be no-regret in this 
case with arbitrarily high probability. 

How does this bound compare to the bound on 
E[I(y-r; f^.)] given by Seeger et al. (2008)? Here, 7t = 
0(2.d(d+i)/(2,.+<i(<i+i))(logr))^ ^l,ile E[I(yj.;/y)] = 

0(T'i/(2'^+rf)(logT)2i'/(2i^+d))^ 

Linear Kernel 

For linear kernels k{x, x') — x' ^ x with ||a;|| < 
1, we can bound 7t directly. Let X.t — \x\ . . . , xt\ G 
E''^^ with all llajtll < 1. Now, 

log |J + a-'^X^j.XY = log |7 + g-'^XtX'^\ 
<log|7 + ff-2£,| 

with D — diagdiag^^(XTX^), by Hadamard's in- 
equality. The largest eigenvalue Ai of XrX|^ is ©(T"), 
so that 

log \I -f (j-^xJXtI < dlog(l + (J-^Ai), 
and 7t = ©(dlogT). 



